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The past few years have witnessed the great success of a new family of paradigms, so-called folksonomy, 
which allows users to freely associate tags to resources and efficiently manage them. In order to uncover the 
underlying structures and user behaviors in folksonomy, in this paper, we propose an evolutionary hypergrah 
model to explain the emerging statistical properties. The present model introduces a novel mechanism that one 
can not only assign tags to resources, but also retrieve resources via collaborative tags. We then compare the 
model with a real- world dataset: Del.icio.us. Indeed, the present model shows considerable agreement with the 
empirical data in following aspects: power-law hyperdegree distributions, negtive correlation between clustering 
coefficients and hyperdegrees, and small average distances. Furthermore, the model indicates that most tagging 
behaviors are motivated by labeling tags to resources, and tags play a significant role in effectively retrieving 
interesting resources and making acquaintance with congenial friends. The proposed model may shed some 
light on the in-depth understanding of the structure and function of folksonomy. 

PACS numbers: 89.20.Hh, 89.65.-s, 05.65.+b, 85.75.-k 



I. INTRODUCTION 

Networks provide us a powerful and versatile tool to rec- 
ognize and analyze complex systems where nodes represent 
individuals, and links denote the relations between them. Re- 
cently, many efforts have been addressed in understanding 
the structure, evolution and dynamics of complex networks 
fl-llt]. The advent of Web 2.0 and its affiliated applications 
bring a new form of user-centric paradigm which can not be 
fully described by pre-existing models on unipartite or bipar- 
tite networks. One such example is the user-driven emerging 
phenomenon, folksonomy, which allows users to upload re- 
sources (bookmarks, photos, movies, etc.) and freely assign 
them with user-defined words, so-called tags. Folksonomy 
requires no specific skills for user to participate, broadens the 
semantic relations among users and resources, and eventually 
achieves its immediate success in a few years. Presently, a 
large number of such applications can be found online, such as 
Del.icio.us lH], Flickr m, CiteULike lH], etc. With the help of 
those platforms, users can not only store their own resources 
and manage them with collaborative tags, but also look into 
other users' collections to find what they might be interested 
in by simply keeping track of the baskets with tags. Unlike 
traditional information management methods where words (or 
indices) are normally pre-defined by experts or administra- 
tors, e.g. the library classification systems, a tagging system 
allows users to create arbitrary tags that even do not exist in 
dictionaries. Therefore, those user-defined tags can reflet user 
behaviors and preferences with which users can easily make 
acquaintance, collaborate and eventually form communities 
with others who have similar interests 131 ■ 

Up to now, a variety of research works have been done in 
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realizing the structure and dynamic process of folksonomy. 
Golder et al. studied the usage patterns of collaborative tag- 
ging systems and classified seven kinds of tag functions lUOtl . 
which is very helpful for us in better understanding both the 
user behaviors and tagging purposes. In addition, the key- 
words or PACS numbers based methods are put forward to 
reveal the underlying structure of co-authorship and citation 
networks ifTH \\% . Furthermore, many efforts have been done 
to explain how folksonomy emerges. Cattuto et al. jlsll inves- 
tigated the dynamics of an open-ended system with a memory- 
based Yule-Simon model. The model considered the aging 
effect of tags, as well as the frequency of tag occurrence. In 
Ref. ITill . they tried to model folksonomy in a form of tripar- 
tite graphs. 

Recently, the hypergraph theory listl allows a hyperedge 
to connect an arbitrary number of vertices instead of two in 
regular graphs. Therefore, it provides us a promising way to 
better understand a wide range of real systems. Up to now, 
it has been found applications in Personalized Recommenda- 
tion UBJEHIj Population Stratification fl9ll . and Cellular Net- 
works 1I20I1 . etc. Besides, the definition is comparatively ap- 
propriate to uncover underlying usage patterns and essential 
structures of folksonomies. Ghoshal et al. 121(1 proposed a 
random hypergraph model to represent the ternary relation- 
ship where a hyperedge consists of one user, one resource and 
one tag, and reproduced many properties of folksonomy by 
the model. Zlatic et al. ||22il extensively defined a number 
of useful topological features based on hypergraph represen- 
tation, which can be considered as a standard tool in under- 
standing the structure of tagged networks. 

In this paper, we propose a hypergraph model to illustrate 
the emergence of some statistical properties in folksonomy, 
including degree distribution, clustering coefficients and av- 
erage distance between nodes. We consider two typical user 
tagging behaviors: (i) one might be aware of a resource via 
web surfing or word-of-mouth propagation, and then save it 
as his/her own favorite collection and annotate it with several 
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tags of related topics for efficient management and retrieval; 
(ii) s/he might firstly pick up one or several compound tags, 
and then choose one possible resource from the retrieval re- 
sults. Recently, a considerable amount of researches have fo- 
cused on the previous motivation ll23il24ll . while the latter one 
is comparatively lack of attention. Actually, tag is able to pro- 
vide more relevant results according to its simple yet essential 
property of collaboration and semantics. Fig. [T] shows those 
two different kinds of mechanisms. 

In this model, users can manage resources with collab- 
orative tags, and find resources by tags via serendipitous 
browsing. We compare the model to one real-world dataset, 
Del. icio. us, and find good agreement between them. 
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FIG. 1: (Color online) Illustration of two typical user tagging be- 
haviors: (a) the user finds a resource (e.g. a book) via web surfing 
and annotate it with three tags for further use; (b) s/he collects one 
or some books by fiUering out unrelated information with the tag 
'book'. 



n. MODELING TRIPARTITE HYPERGRAPHS 

We begin our study with some related definitions of tripar- 
tite hypergraph that we will analyze. In this paper, we use the 
tripartite hypergraph representation given by Ref. ll2lll . where 
a hyperedge simply consists of one user, one resource and one 
tag. Fig. |2]gives a visual explication of such structure. 

In a tripartite hypergraph, the network G can be briefly de- 
picted by G={V,H), where V denotes the vertices and H rep- 
resents the set of hyperedges. V=U U R U T where U, R and 
T represent the set of users, resources and tags respectively, 
and H 'ZU x R x T is usually much smaller than the number 
of all the possible triangles. Correspondingly, the Del.icio.us 
dataset we collected has 15009 users, 243 11 90 resources and 
325 120 distinct tags, which subsequently constitute 1 1739998 
hyperedges. 



A. Model 

Consequently, we are mainly interested in the effect of tag- 
ging behaviors and the role of tags in networks. Therefore, we 
fix the distribution of user activities according to the empirical 
data. Thus, the model can be described as following: 




FIG. 2: (Color online) A hyperedge illustration of the basic unit in 
our network. There are three types of vertices in each hyperedge 
(represented as a triangle), depicted by one red circle, one green rect- 
angle and one blue triangle which respectively represent a user, a 
resource and a tag in folksonomy. 



• At each time step, pick up a random user u according to 
the given distribution of user activities. 

• For u, s/he can either choose a resource with probability 
p, or select an arbitrary tag with probability \-p. 

• If M is activated from the aspect of resource, s/he will 
randomly select an existing resource in the system with 
probability \-pi according to its popularity, or introduce 
a completely new resource with probability pi. And 
then s/he will annotate it with a few tags. For simplic- 
ity, in this paper, we only consider that u will assign 
only one tag to the selected resource r. Thus, u could 
choose the tag from his/her own vocabulary with proba- 
bility p2 according to how many times s/he has adopted 
it , or from the resource vocabulary with probability 
]33 according to how many times it has been associated 
with the target resource, or introduce a new tag with 
probability l-(p2+P3) if s/he does not find a suitable or 
personalized tag to describe r. 

• If u decides to find a relevant resource from a specific 
topic, s/he will choose a random tag t based on its pop- 
ularity, and then save one of the relevant resources ac- 
cording to how many triangles they have appeared to- 
gether with t. 

In this model, a new hyperedge {u,r,t) is produced either 
from the perspective of resources or tags at each time step. 
When one tries to give a tag to a certain resource, s/he might 
choose a previous tag s/he used before, or pick up one tag 
recommended by the system. A new tag is added if no ap- 
propriate tags is available to describe that resource. Thus a 
tag-growth mechanism is considered in the present model. We 
then repeatedly run the model until enough number of hyper- 
edges is obtained. Moreover, we simply assume that there is 
only one hyperedge emerges once the user is activated, which 
is not the case in real networks. However, such simplified 
assumption could help us examine the effects of different tag- 
ging behaviors on the emergence of folksonomies. To evaluate 
our model, we measure the following quantities (Fig. [3]gives 
a detailed description of these quantities): 
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(i) hyperdegree distribution: defined as the proportion tliat 
eacli hyperdegree occupies, where hyperdegree is defined as 
the number of hyperedges that a regular node participates in. 

(ii) clustering coefficients: defined as the proportion of real 
number of hyperedges to all the possible number of hyper- 
edges that a regular node could have. 

(iii) average distance: defined as the average shortest path 
length between two random nodes in the whole network. 

Since we are mainly interested in how the tagging behaviors 
influence the emergence of folksonomies, we fix other param- 
eters and investigate the effect of p. In the following analysis, 
we set pi=0.3, p2=P3=0.45 as constants. 
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from a random tag t at time I, and 5ij is the Kronecker delta. 
The first brace shows the choice described in the model, where 
the first term is the probability of adding a new resource and 
the second term is the probability of selecting an existing re- 
source. The second brace depicts the evolutionary process 
from the aspect of tags. However, it is not easy to identify 
the distribution of each individual's absent resources, we ap- 
proximatively consider that distribution is direct proportion to 
that of the system, that is. 
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(2) 



Integrate Eq. 
condition 



(1) and Eq. (2), as well as the stationary 
Urn X;(P(fc,,z))A, we have: 



P{kr.-1), 



for kr>l. When kr=l, Eq. (1) can be simplified to: 

1 



2-ppi 



(3) 



(4) 



Combine Eq. (3) and Eq. (4), we can recursively obtain the 
final solution: 



FIG. 3: (Color online) A descriptive hypergraph consists of two 
users, four resources and three tags. Take user U2 and resource Ri 
for example, the measurements are denoted as: (i) U2 has partici- 
pated six hyperedges, which means its hyperdegree is 6; (ii) U2 has 
directly connected to three resources and three tags, which suggests it 
possibly has 3x3=9 hyperedges in maximal. Thus its clustering co- 
efficient equals 6/9=0.667, where 6 is its hyperdegree; (iii) the short- 
est path from U2 to Ri is U2 — Ti — Ri, which indicates the distance 
between U2 and _Ri is 2. 



B. Hyperdegree Distribution 



According to 112 III . hyperdegree is defined as how many 
triples a regular node takes part in. Thus we denote P(k^), 
P(kr)^ P{kt) '^he fraction of users, resources and tags, re- 
spectively. In terms of the model, is directly derived 
from the empirical data. Therefore, we mainly focus on the 
dynamics of P[kr) ^^'^ P{kt)- Firstly, we can write down the 
rate equation for the distriljution of resources ^ (In order to 
avoid confusion of the time symbol, we use / to represent the 
time in following descriptions): 

P{k^,i+i) = p{PiP(k,.,i) + (1 - pr{k,^,i))P{k^,i) 

+(1 - hr.s)pr{k^-id)P{k^-i,i)]} 
+(1 - - hr.,i)ptr(k^-i^i)P{k^-ij) 

+ [1 - ptl\k^^l)]P[kr.l)} + T^fer,!' 

(1) 

where P{kr,i) is denoted as the hyperdegree distribution of re- 
sources at time I, pr(^k,..i) is the probability to pick up an uncol- 
lected resource for u with hyperdegree according its pop- 
ularity, ptr i^kr i)=kr / 1 is the probability to choose a resource 



r(fc^)r(l + ai) 
r(fc,. + 1 + ai) ' 



(5) 



where ai=r^ — and F is the Gamma function. 

i—ppi 

Analogously, we can also write down the tag hyperdegree 
distribution in the form of rate equation: 



[pr(kt-i,i)P{kt~i,i){l - 5k,^i) 
+ i'^-pr{kui))P{kt,i)] 

+p{l -P2 -P3.)P(kt,l) + 



(6) 



where pri^kt,i) is the probability of picking up a random tag 
with hyperdegree kt at time /. According to the present model, 
there are four mechanisms that drive the growth of tags: (i) 
user u selects one tag from his/her own vocabulary with prob- 
ability P2; (ii) u chooses one word from the set of tags asso- 
ciated with the the target resource with probability pa; (iii) a 
new tag is introduced with probability I-P2-P3', (iv) u selects 
an interesting tag t from all the possible candidates and saves 
a resource that is relevant with t. Eq. (6) exactly expresses the 
integrated effect on tag evolution of those mechanisms. 

We take the similar assumption of Eq. (2) that the individ- 
ual's tag hyperdegree distribution is direct proportion to that 
of the system: 



Pr{kt,i) 



I ■ 



(7) 



We then follow the same processes of Eq. (3) and Eq. (4), 
the solution will read: 
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FIG. 4: (Color online) The hyperdegree distributions of three type of nodes, (a) the empirical cumulative hyperdegree distribution of users 
which follows a stretched exponential distribution P{ku) oc exp^''°"'"°"' , where ko is a constant. The inset gives the fitting result of the 
exponent c=0.64 according to the method used in (27ll : (b) the empirical, simulation and analytical results of resource hyperdegree distribution, 
following power-low p^kr) k~''' and ^=2.28; (c) the empirical, simulation and analytical results of tag hyperdegree distribution, following 
power-low P(kt) oc fc^"*' ^nd (^=2.13. The simulation and analytical results of (b) and (c) are obtained when p=0.8. 
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r(fct)r(i + a2) 
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(8) 



where 09=1 — rr^ 7. 

Fig. |4] shows the simulation, analytical and empirical re- 
sults of hyperdegree distributions in both the real and modeled 
networks. Fig. Ua) shows the empirical data of users' cumu- 
lative hyperdegree distribution, which follows a stretched ex- 
ponential distribution [25, 26]. Fig. HJb) and Fig. Hfc) show 
good agreements among empirical observation and analytical 
result, while the inconsistent in Fig. Htb) might be caused by 
our assumption that results in a comparatively large number 
resources with small hyperdegrees. Note that p — 0.8 indi- 
cates that most actions of the tagging are from the resource 
aspect in folksonomies. 

In addition, we measure the effect on hyperdegree distri- 
bution with different values of p. In Fig. |5ja), the resource 
hyperdegree distribution is in good agreement only when p 
increases over 0.7. Whereas, the slope of tag hyperdegree dis- 
tribution does not change much with virous value of p. This 
might be caused by two reasons: (i) the evolution of folk- 
sonomy is driven primarily by assigning tags to the target re- 
source, which is consistent with large value of p; (ii) when p is 
small, the fat-tail of resources with small degree will remark- 
ably affects the fitting result. 



C. Clustering Coefficients 

Clustering in a network measures the likelihood that two 
neighbors of a given node are inclined to connect to each 
other. Watts and Strogatz ll28ll have introduced the cluster- 
ing coefficient to measure the amount of clustering for a given 
node in normal unipartite networks. However, this definition 
is not fully compatible with the hypergraph case, since a reg- 
ular node connects two other different types of nodes. Thus, 
we adopt the definition of user clustering coefficient given in 




1.9 




0.2 0.4 O.f 
P 



1.0 



FIG. 5: (Color online) The slopes of hyperdegree distribution change 
according to different value of p for analytical and simulation results, 
(a) the variation of <j}. (b) the variation of ip. Both the two distribu- 
tions show scale-free property under disparate values of p, that is, 
p{k) oc k^", where a refers to <j} and ip in (a) and (b), respectively. 
(j> and (fi are measured by Least Squares Method (LSM). 
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where fc„ is the hyperdegree of user u, i?„ is the number of re- 
sources that u collects and T„ is the number of tags that u pos- 
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FIG. 6: (Color online) The clustering coefficients versus collapsed hyperdegrees. (a) the user clustering coefficient versus collapsed user 
hyperdegree; (b) the resource clustering coefficient versus collapsed resource hyperdegree; (c) the tag clustering coefficient versus collapsed 
tag hyperdegree. In (c), the empirical data is shown in log bin in order to alleviate the fluctuation resulting from insufficient data, interfering 
the exhibition of its statistical property. All the three plots are obtained with p=0.8. 



sesses. The above definition measures the fraction of possible 
pairs present in the neighborhood of u. A larger C„ indicates 
that u has more similar topic of resources, which might also 
show that u has more concentrated on personalized or special 
topics. Then the hyperdegree-based clustering coefficient is 
averaged over all the nodes with the same hyperdegrees. 

In order to compute Cu, we shall consider the evolution- 
ary dynamics of Tu, the number of tags used by the selected 
user, as well as the dynamics of T, the current number of tags 
existing in the system. We can write the differential functions; 



di 



^[PP3(1-P1)(1-^) 
+P(1-P2-P3)(1-^) 



(10) 



To''' 



where Tq is the total number of tags we initially set in the 
model and L is the total number of designed simulation steps. 
Since we assume that only one tag is allowed to be assigned 
at each time step, the hyperdegrees of users and resources are 
degenerated to bipartite case. Therefore, we get fc„ = 
Thus, Eq. (|9]l can be rewritten as: 



have collect r. Fig. |6jb) shows the numerical solution for Eq. 
(fT2b . as well as the empirical and simulation results. 
And the dynamics of Ct is as following: 



dR 
di 

Ct 



PPI, 

kt 
UfRt ' 



(13) 

where kt is tag hyperdegree, U is the number of users which 
is fixed in the model, Ut is the number of users who have 
used tag t, R is the number of resources existing in the sys- 
tem, and Rt is the number of resources labeled with t. Fig. 
|6jc) shows the numerical solution for Eq. (O, as well as the 
empirical and simulation results. All the three plots in Fig. |6] 
show negative correlations between clustering coefficient and 
hyperdegree on both the real- world and modeled networks. 
It might indicate the hierarchical structure of tripartite hyper- 
graphs 13 1], and suggest that users with larger hyperdegrees 
have more diverse interests, and vice verse. 



C„ = — . (11) 

^ u 

Unfortunately, It is not easy to get the explicit expression 
of Eq. ( [Tol l. Instead, we find the numerical solution by com- 
bining Eq. ( [Tol l and Eq. ( fTTl i. Fig. |6la) shows the good 
consistency among the empirical, simulation and numerical 
results. 

Analogously, we can also write the dynamics of C^: 

' ^ = ^p[p2{l - f ) + (1 - p2 - P3)(l - t )] 

^ f =p(l-p2-p3)(l-§), 

dkr _ K 

di I ' 

r* — fcr J_ 

(12) 

where kj. is resource hyperdegree, Tr is the number of tags 
attached to resource r, and Ur is the number of users who 



D. Average Distance 

Another important quantity is the distance, D, between a 
random pair of nodes in a network. Hence, the average dis- 
tance, (£)), measures the efficiency of retrieving a target node 
in a network. Take a friendship network for example, {D) is 
given by counting the average shortest path length between a 
random user and another arbitrary user. Therefore, {D) as- 
sesses how easily yet effectively for a user to make acquain- 
tance with others in a given friendship network. 

However, in the case of tripartite hypergraph, there are three 
different regular nodes. Therefore, the shortest path length can 
be defined as the minimal number of hyperedges that must be 
traversed to go from vertex to vertex. Fig. |7] shows the {D) 
between any two types of vertices. Fig. 13a) and Fig. |7Jb) 
show the average distances of the bipartite network and hyper- 
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FIG. 7: (Color online) The average distances of bipartite and tripartite networks. Since the data is huge, we calculate (D) by sampling 
randomly pairs of nodes until stationary value is obtained, (a) the average distances of user-user (Duu), user-resource (Dur) and resource- 
resource (Drr) versus the number of samplings in the bipartite network, ignoring the tag information of Del.icio.us; (b) the average distances 
of user-user, user-resource, user-tag (Dut), resource-resource, resource-tag (Drt) and tag-tag (Dtt) versus the number of samplings in the 
tripartite hypergraph of Del.icio.us; (c) the average distances of user-user, user-resource, user-tag, resource-resource, resource-tag and tag-tag 
versus the number of samplings in the tripartite hypergraph produced by the present model. All the curves converge fast with just a small 
number of samplings, which indicates a small-world property in both bipartite and tripartite networks; (d) the stationary average distances 
change according to different values of p in the modeled network. 



graph structure of Del.icio.us, respectively. We can see that: 
(i) tags can significantly shorten {D) for any pair of nodes 
in comparison with the bipartite case. For example, {D) of 
user-user pair is enhanced from 3.587 to 2.205, {D) of user- 
resource pair is improved from 3.947 to 2.676, and the value 
of (D) of resource-resource pair is shortened from 4.641 to 
3.386. These considerable improvements might indicate that 
tags play an important role in Information Retrieval; (ii) in 
Fig- 13b), the magnitude strictly follows the order: Du < 
Dr < Dt in both general and special cases. For example, we 
have: < Dur < Dut for users, Dur < Drr < Drt 

for resources, and Dut < Drt < Dtt for tags. The similar 
patten of those orders might imply that Del.icio.us is a user- 
centric system so that we can more easily find any information 
through users than others. Besides, the main purpose of tag- 
ging is to more efficiently and effectively manage resources, 
which keeps coherence of comparatively large value of p in 
previous sections. Fig. Qc) reproduces such exciting phe- 
nomenon withp=0.8 in the model. Furthermore, we study the 
effect of different values of p on the distances. In Fig. |3d), it 



is shown that the order does almost keep steady whatever the 
value p changes to. Additionally, Fig. |7jd) also indicates that 
all the distances decreases monotonously with the lessening 
of p, which might suggest that the more often we use tags, the 
more effective we can find target information. 



III. CONCLUSION AND DISCUSSION 

In this paper, we have proposed an evolutionary hypergraph 
model to study the dynamical properties of social tagged net- 
works, so-called foUcsonomies. The present model assumes 
that there are two typical tagging behaviors based on pref- 
erential attachment mechanism: (i) assigning tags to users' 
favorite resources; (ii) saving resources that are relevant to in- 
teresting tags. The resulting tripartite hypergraph shows good 
agreement with a real-world network, Del.icio.us, in follow- 
ing aspects: (i) the power-law hyperdegree distributions are 
generated for resources and tags, which indicates the hetero- 
geneous topology; (ii) the decay of average clustering coeffi- 
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cients with the increase of hyperdegree, which may indicate 
hierarchical structure of tripartite hypergraphs ; (iii) the av- 
erage distances between vertices of hypergraph are compara- 
tively smaller than those in corresponding bipartite networks 
without tags; (iv) the relatively small average distance indi- 
cates a small-world property, which facilitates the serendipi- 
tous discovery of interesting contents and congenial compan- 
ions; (v) all the above properties are found relatively high 
consistency with a comparatively large value of p=0.8, which 
suggests that the majority of actions is motivated by the first 
tagging behavior Consequently, this model quantitatively re- 
veals the accessorial yet significant role that tags play in folk- 
sonomies. 

However, despite the good agreements in reproducing sev- 
eral features with real data, it is not easy to fully uncover the 
mechanisms dominating the emergence of folksonomy. This 
paper only provides a start point for understanding the un- 
derlying motivations in facilitating a variety of intricate prop- 
erties in such new paradigms. The present model considers 



that only one hyperedge is allowed to come forth at each time 
step, which is moderately not the case in real systems. The tag 
co-occurrence ifisi I29I1 and social cognitive imitation mecha- 
nisms ll32ll can be taken into account to improve the proposed 
model. 
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